Learning value functions from samples from the World Model

1 Introduction

The value-based methods previously presented all presuppose a known world model (i.e., the transition probability distribution is known). However, in most RL applications—especially in the real world—the world model isn’t known; therefore we rely on estimation by having the agent interact with the environment and using the sampled data from those interactions.

There are two well-tested methods used in this regard: Monte Carlo control and Temporal Difference learning.

2 Monte Carlo control

Monte Carlo control consists of two steps:

3 Temporal Difference (TD) Learning

Temporal Difference learning reduces Bellman error for sampled states (or state–action pairs) during transitions:

\[ V(s_t) \leftarrow V(s_t) + \eta \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right], \]

where \( V(s_t) \) is moved toward the target value \( r_t + \gamma V(s_{t+1}) \).

4 Combining Monte Carlo and TD

Monte Carlo and TD are often combined. TD updates at each transition, whereas MC waits until an episode finishes (or the horizon is effectively reached). We can keep the TD estimation at each transition and, for a chosen \( n \), approximate the remainder of the trajectory with a weighted average of \( n \)-step returns parameterized by \( \lambda \in [0,1] \):

\[ G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}. \]

This is the lambda-return, which we use in place of \( G_{t:t+n} \) for subsequent TD updates. Notable cases:

TD policy learning can be done on-policy (SARSA) or off-policy (Q-learning). We’ll expand on both and provide simple implementations in the next article.